-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-26959 Brotli compression support #4353
Conversation
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Rebase and add supplemental models. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Precommit failures are all of the form
so I don't have the supplimental models changes quite right yet, even though it seems fine locally. |
💔 -1 overall
This message was automatically generated. |
The precommit failure is unrelated to this change
|
I don't understand the license errors.
This is NOT failing locally, "mvn clean install assembly:single -DskipTests" |
Oh perhaps b436ab4 has not been tested yet. Still shows as yellow. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
I subjected the new codec to the same IntegrationTestLoadCommonCrawl stress test scenario used for the earlier compression work and it passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and small nits -- otherwise, looks good.
</dependency> | ||
<!-- native Java compression codecs --> | ||
<dependency> | ||
<groupId>com.aayushatharva.brotli4j</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we shade this dependency? Is there a scenario where user code and this jar find themselves on the same class path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have not shaded any other compression support libraries. It could be a follow up issue to do them all if it seems like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't believe shading can work. All of the compression support libraries except for aircompressor (with its raison d'être being pure Java implementation) extract platform specific native shared libraries and load them dynamically at runtime. The symbols compiled on the native side of the JNI bridge cannot be relocated by shading so the result will fail to link. I made sure we only included options that have native support for Linux x86_64 and Linux aarch64. Usually they also have support for MacOS x86_64. That is the case for Brotli4j.
...compression-brotli/src/main/java/org/apache/hadoop/hbase/io/compress/brotli/BrotliCodec.java
Outdated
Show resolved
Hide resolved
...compression-brotli/src/main/java/org/apache/hadoop/hbase/io/compress/brotli/BrotliCodec.java
Outdated
Show resolved
Hide resolved
...ession-brotli/src/main/java/org/apache/hadoop/hbase/io/compress/brotli/BrotliCompressor.java
Outdated
Show resolved
Hide resolved
...ession-brotli/src/main/java/org/apache/hadoop/hbase/io/compress/brotli/BrotliCompressor.java
Outdated
Show resolved
Hide resolved
...sion-brotli/src/main/java/org/apache/hadoop/hbase/io/compress/brotli/BrotliDecompressor.java
Outdated
Show resolved
Hide resolved
hbase-common/src/main/java/org/apache/hadoop/hbase/io/compress/Compression.java
Show resolved
Hide resolved
Rebased on latest master, incorporating improvements from review feedback. |
I have an approval and addressed all review feedback. Planning to merge this evening unless objection. Waiting for latest precommit to check before commit. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conf.getInt(CommonConfigurationKeys.IO_COMPRESSION_CODEC_ZSTD_BUFFER_SIZE_KEY, | ||
CommonConfigurationKeys.IO_COMPRESSION_CODEC_ZSTD_BUFFER_SIZE_DEFAULT)); | ||
return size > 0 ? size : 256 * 1024; // Don't change this default | ||
// IO_COMPRESSION_CODEC_ZSTD_BUFFER_SIZE_DEFAULT is 0! We can't allow that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, why we have an invalid default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is Hadoop's default, IO_COMPRESSION_CODEC_ZSTD_BUFFER_SIZE_DEFAULT is a Hadoop constant. We reuse these where we can. But the value of '0' gives an error creating the (Hadoop) compression stream. So we don't accept it here.
💔 -1 overall
This message was automatically generated. |
@Apache9 I did not save the error text. It is the standard error text you get when LICENSE.vm cannot be processed. It said to add "apache license 2.0" to non_aggregated_fine if appropriate. So that is what I did, because that is true in this case. The reason this happens is due to the Brotli4j POMs, which do not use the exact text to describe the Apache 2.0 license as expected. Nothing we can do about it except add it to non_aggregated_fine as intended, I believe. I tried to substitute for the Brotli4j model in suppliemental_models but this did not work. Brotli4j is licensed with the Apache 2.0 license. |
He responded upthread. |
After ingesting ~5TB of CommonCrawl data, brotli compressing at level 7 offers a few % improvement (-85% vs -82%) over zstandard also compressing at its notion of level 7. This stands to reason as brotli will fall back on its static dictionary pre-trained on web content before adding input to the generic/dynamic dictionary; while zstandard does not have this advantage (although it could, if the user configures a suitable pre-trained dictionary for it (see HBASE-26353)). At scale that difference would be noticeable. |
This reverts commit df312d9.
…li4j and its natives
Rebased on master, using supplemental models to fix the LICENSE.vm issue as required. (@busbey pointed me at HBASE-26760 where another workaround like I proposed here was backed out.) |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
I haven't gotten to dig into the change beyond the licensing stuff. but seems good to go to me given the existing review and this update on license handling. |
Signed-off-by: Nick Dimiduk <[email protected]>
Signed-off-by: Nick Dimiduk <[email protected]>
I was about to document the compression improvements upcoming in 2.5.0 and realized we lacked support for one of the major compression formats in use today. (Arguably brotli is not often used to compress data at rest, but is widely supported among web browsers and HTTP servers and clients.)
Brotli support for Java is provided by Apache 2 licensed Brotli4j available in Maven central.
It is not difficult to add it, so let's add it. There might be some advantage to opting for brotli compression when storing web data given the precomputed static dictionary of such content included in its specification.